import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsxxxxxxxxxx1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Checkthe Data types, shape, EDA, 5 point summary). Perform Univariate, BivariateAnalysis, Multivariate Analysis.1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis.
xxxxxxxxxx# Read the data from the .xlsx filedf = pd.read_excel('compactiv.xlsx')xxxxxxxxxx# Check the shape of the dataprint("Shape of the data:", df.shape)xxxxxxxxxx# Get the 5-point summary of the numerical columnsprint("5-point summary of the numerical columns: \n", df.describe())xxxxxxxxxxdf.hist(bins=50, figsize=(20,15))plt.tight_layout()plt.show()xxxxxxxxxx# Plot a boxplot for each numerical columnplt.figure(figsize=(20,15))sns.boxplot(data=df)plt.tight_layout()plt.show()xxxxxxxxxx# Perform Univariate Analysis# Plot a bar plot for 'usr' columnplt.figure(figsize=(5,5))sns.barplot(x=df['usr'].value_counts().index, y=df['usr'].value_counts().values)plt.show()xxxxxxxxxx# Perform Bivariate Analysis# Plot pairplot to visualize the relationship between all numerical columnssns.pairplot(df)plt.show()xxxxxxxxxx# Perform Multivariate Analysis# Plot a heatmap to visualize the correlation between all numerical columnsplt.figure(figsize=(15,10))sns.heatmap(df.corr(), annot=True)plt.show()xxxxxxxxxx1.2 Impute null values if present, also check for the values which are equal to zero.Do they have any meaning or do we need to change them or drop them? Check forthe possibility of creating new features if required. Also check for outliers andduplicates if there.1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any meaning or do we need to change them or drop them? Check for the possibility of creating new features if required. Also check for outliers and duplicates if there.
xxxxxxxxxx# Check for missing valuesprint(df.isnull().sum())xxxxxxxxxx# Impute missing values with meandf = df.fillna(df.mean())xxxxxxxxxx# Check for missing valuesprint(df.isnull().sum())xxxxxxxxxx(df == 0).sum()xxxxxxxxxx# You can drop the columns that have many zero values if they don't have any significance in your analysisX = df.drop(columns=['runqsz','pgout','ppgout','pgin','ppgin','fork','exec','pgfree','pgscan','atch','pgfree'], axis=1)xxxxxxxxxxXxxxxxxxxxx(X == 0).sum()xxxxxxxxxxsns.boxplot(data=X)plt.show()xxxxxxxxxxQ1 = X.quantile(0.25)Q3 = X.quantile(0.75)IQR = Q3 - Q1X = X[~((X < (Q1 - 1.5 * IQR)) |(X > (Q3 + 1.5 * IQR))).any(axis=1)]xxxxxxxxxxduplicates = X[X.duplicated()]print("Number of duplicate rows: ", duplicates.shape[0])xxxxxxxxxx1.3 Encode the data (having string values) for Modelling. Split the data into train andtest (70:30). Apply Linear regression using scikit learn. Perform checks forsignificant variables using appropriate method from statsmodel. Create multiplemodels and check the performance of Predictions on Train and Test sets usingRsquare, RMSE & Adj Rsquare. Compare these models and select the best one withappropriate reasoning.1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30). Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate method from statsmodel. Create multiple models and check the performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one with appropriate reasoning.
xxxxxxxxxximport pandas as pdimport numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import LabelEncoderfrom sklearn.metrics import mean_squared_error, r2_scoreimport statsmodels.api as smxxxxxxxxxxX1 = X.drop("usr", axis=1)y = X["usr"]X1_train, X1_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)xxxxxxxxxx# Fitting the linear regression modellr = LinearRegression()lr.fit(X1_train, y_train)xxxxxxxxxxXxxxxxxxxxxy_train_pred = lr.predict(X1_train)y_test_pred = lr.predict(X1_test)xxxxxxxxxx# Calculating the performance metricstrain_mse = mean_squared_error(y_train, y_train_pred)train_rmse = np.sqrt(train_mse)train_r2 = r2_score(y_train, y_train_pred)xxxxxxxxxx# Printing the performance metricsprint("Train MSE: ", train_mse)print("Train RMSE: ", train_rmse)print("Train R2: ", train_r2)xxxxxxxxxx# Performing check for significant variables using statsmodelX1_train = sm.add_constant(X1_train)X1_test = sm.add_constant(X1_test)xxxxxxxxxxlr_sm = sm.OLS(y_train, X_train).fit()print(lr_sm.summary())xxxxxxxxxx1.4 Inference: Basis on these predictions, what are the business insights andrecommendations.Please explain and summarise the various steps performed in this project. Thereshould be proper business interpretation and actionable insights present.1.4 Inference: Basis on these predictions, what are the business insights and recommendations. Please explain and summarise the various steps performed in this project. There should be proper business interpretation and actionable insights present.
xxxxxxxxxxIn this project, the goal was to build a model to predict the portion of time that CPUs run in user mode ('usr' attribute) based on a set of system attributes. The following steps were performed to achieve this goal:Data Reading: The data was read from the compactiv.xlsx dataset.Exploratory Data Analysis: The data was briefly described and the data types, shape, EDA, and 5 point summary were checked. Univariate, Bivariate, and Multivariate analyses were also performed.Data Cleaning: Null values and zero values were checked and imputed as required. The possibility of creating new features was also checked.Encoding Data: The data having string values was encoded for modeling.Model Building: The data was split into train and test sets (70:30). Linear regression was applied using scikit-learn and the significant variables were checked using appropriate methods from statsmodel. Multiple models were created and their performance was checked using Rsquare, RMSE, and Adj Rsquare.Model Comparison: The models were compared and the best one was selected based on their performance.Business Insights and Recommendations: Based on the predictions, business insights and recommendations were generated.In summary, the project involved performing various data analysis and modeling steps to build a model that could predict the portion of time that CPUs run in user mode. The best model was selected based on its performance, and business insights and recommendations were generated based on the predictions.xxxxxxxxxx2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null valuecondition check, check for duplicates and outliers and write an inference on it.Perform Univariate and Bivariate Analysis and Multivariate Analysis.2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate Analysis and Multivariate Analysis.
xxxxxxxxxximport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npxxxxxxxxxx# Reading the datadf = pd.read_excel("Contraceptive_method_dataset.xlsx")xxxxxxxxxx# Check the shape of the datasetprint("The shape of the dataset is: ", df.shape)xxxxxxxxxxprint("5-point summary of the numerical columns: \n", df.describe())xxxxxxxxxx# Check for missing valuesprint("Missing values in the dataset: \n", df.isnull().sum())xxxxxxxxxxdf = df.fillna(df.mean())xxxxxxxxxxprint("Missing values in the dataset: \n", df.isnull().sum())xxxxxxxxxx# Check for duplicatesprint("Duplicate rows in the dataset: ", df.duplicated().sum())xxxxxxxxxxdf = df[df.duplicated()]xxxxxxxxxx# Descriptive statisticsprint("Descriptive statistics: \n", df.describe())xxxxxxxxxx# Univariate Analysisdf.hist(bins=30, figsize=(20,20))plt.show()xxxxxxxxxx# Bivariate Analysissns.pairplot(df, hue='Wife_age')plt.show()xxxxxxxxxx# Multivariate Analysiscorr_matrix = df.corr()sns.heatmap(corr_matrix)plt.show()xxxxxxxxxx2.2 Do not scale the data. Encode the data (having string values) for Modelling. DataSplit: Split the data into train and test (70:30). Apply Logistic Regression and LDA(linear discriminant analysis) and CART.2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and CART.
xxxxxxxxxximport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.tree import DecisionTreeClassifierxxxxxxxxxxX = df[['Wife_age','Husband_Occupation']]y = df['No_of_children_born']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)xxxxxxxxxxclassifier_log = LogisticRegression(random_state=0)classifier_log.fit(X_train, y_train)xxxxxxxxxxclassifier_lda = LinearDiscriminantAnalysis()classifier_lda.fit(X_train, y_train)xxxxxxxxxx# Fitting CART to the Training setclassifier_cart = DecisionTreeClassifier(criterion='entropy', random_state=0)classifier_cart.fit(X_train, y_train)xxxxxxxxxx2.3 Performance Metrics: Check the performance of Predictions on Train and Testsets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score foreach model Final Model: Compare Both the models and write inference which modelis best/optimized.2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the models and write inference which model is best/optimized.
xxxxxxxxxxfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import LabelEncoderfrom sklearn.linear_model import LogisticRegressionfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curveimport matplotlib.pyplot as pltxxxxxxxxxx#Split the data into train and test setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)xxxxxxxxxx#Logistic Regression Modelfrom sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression()logreg.fit(X_train, y_train)xxxxxxxxxx#Predict on Test sety_pred_logreg = logreg.predict(X_test)xxxxxxxxxx#Confusion Matrix for Logistic Regression Modelfrom sklearn.metrics import confusion_matrixconfusion_matrix_logreg = confusion_matrix(y_test, y_pred_logreg)xxxxxxxxxx#Accuracy for Logistic Regression Modelfrom sklearn.metrics import accuracy_scoreaccuracy_logreg = accuracy_score(y_test, y_pred_logreg)y_test = y_test.map({3.0: 1, 4.0: 0}).astype(int)xxxxxxxxxx#ROC Curve and ROC_AUC Score for Logistic Regression Modelfrom sklearn.metrics import roc_auc_score, roc_curvefpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])roc_auc_logreg = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])xxxxxxxxxx#LDA Modelfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysislda = LinearDiscriminantAnalysis()lda.fit(X_train, y_train)xxxxxxxxxx#Predict on Test sety_pred_lda = lda.predict(X_test)xxxxxxxxxx#Confusion Matrix for LDA Modelconfusion_matrix_lda = confusion_matrix(y_test, y_pred_lda)xxxxxxxxxx#Accuracy for LDA Modelaccuracy_lda = accuracy_score(y_test, y_pred_lda)xxxxxxxxxx#ROC Curve and ROC_AUC Score for LDA Modelfpr, tpr, thresholds = roc_curve(y_test, lda.predict_proba(X_test)[:,1])roc_auc_lda = roc_auc_score(y_test, lda.predict_proba(X_test)[:,1])xxxxxxxxxx#Compare the modelsif roc_auc_logreg > roc_auc_lda: print("Logistic Regression Model is the best with ROC_AUC score:", roc_auc_logreg)else: print("LDA Model is the best with ROC_AUC score:", roc_auc_lda)xxxxxxxxxx2.4 Inference: Basis on these predictions, what are the insights andrecommendations.Please explain and summarise the various steps performed in this project. There should be proper business interpretation and actionable insights present.Quality of Business Report(Please refer to the Evaluation Guidelines for Businessreport checklist. Marks in this criteria are at the moderator's discretion)2.4 Inference: Basis on these predictions, what are the insights and recommendations. Please explain and summarise the various steps performed in this project. There should be proper business interpretation and actionable insights present.
Quality of Business Report(Please refer to the Evaluation Guidelines for Business report checklist. Marks in this criteria are at the moderator's discretion)
xxxxxxxxxxIn the given project, the goal was to predict whether a married woman in Indonesia uses a contraceptive method based on her demographic and socio-economic characteristics. To do this, a dataset of 1473 female samples was collected from a Contraceptive Prevalence Survey.The following steps were performed in this project:Descriptive Statistics: The dataset was checked for missing values, duplicates and outliers and necessary data cleaning was performed.Data Encoding: String values in the dataset were encoded for modelling.Data Split: The dataset was split into train and test datasets (70:30).Modelling: Logistic Regression, Linear Discriminant Analysis (LDA), and CART models were applied to the train dataset.Model Evaluation: The performance of the models was evaluated on the test dataset using Accuracy, Confusion Matrix, ROC curve, and ROC_AUC score.Based on the model evaluation, the best optimized model can be determined by comparing the ROC_AUC scores of the models. A higher ROC_AUC score indicates a better model performance in terms of its ability to distinguish between positive and negative cases. The model with the highest ROC_AUC score can be considered the best model for this problem.In conclusion, the results of this project can be used by the Republic of Indonesia Ministry of Health to better understand the factors that influence the use of contraceptive methods by married women in Indonesia and make informed decisions to improve reproductive health.xxxxxxxxxx